Accelerating Wright–Fisher Forward Simulations on the Graphics Processing Unit

Forward Wright–Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the Central Processor Unit (CPU), thus limiting their usefulness. However, the single-locus Wright–Fisher forward algorithm is exceedingly parallelizable, with many steps that are so-called “embarrassingly parallel,” consisting of a vast number of individual computations that are all independent of each other and thus capable of being performed concurrently. The rise of modern Graphics Processing Units (GPUs) and programming languages designed to leverage the inherent parallel nature of these processors have allowed researchers to dramatically speed up many programs that have such high arithmetic intensity and intrinsic concurrency. The presented GPU Optimized Wright–Fisher simulation, or “GO Fish” for short, can be used to simulate arbitrary selection and demographic scenarios while running over 250-fold faster than its serial counterpart on the CPU. Even modest GPU hardware can achieve an impressive speedup of over two orders of magnitude. With simulations so accelerated, one can not only do quick parametric bootstrapping of previously estimated parameters, but also use simulated results to calculate the likelihoods and summary statistics of demographic and selection models against real polymorphism data, all without restricting the demographic and selection scenarios that can be modeled or requiring approximations to the single-locus forward algorithm for efficiency. Further, as many of the parallel programming techniques used in this simulation can be applied to other computationally intensive algorithms important in population genetics, GO Fish serves as an exciting template for future research into accelerating computation in evolution. GO Fish is part of the Parallel PopGen Package available at: http://dl42.github.io/ParallelPopGen/.

Another way to jump start the simulation is by assuming all extant populations are in mutation-selection balance at the beginning of the simulation. Under general mutationselection equilibrium (MSE), the proportion of mutations at every frequency in the population can be calculated via numerical integration over a continuous frequency diffusion approximation (see Kimura 1964). While this constrains the starting evolutionary state to mutation-selection equilibrium, this allows one to then start simulating the selection and demographic scenario of interest immediately. Due to current limitations of the MSE model in GO Fish, the mutation-selection equilibrium scenario does not, as of yet, include migration from other populations or random fluctuations in selection intensity -nor can the code calculate the number of generations ago a mutation at frequency x is expected to have arisen at. Instead all mutations in the initial mutation array said to have arisen at time t = 0. The model is detailed below: Using the glossary from Table 1, for any given population j at time t = 0: € µ = µ( j,0), s(x) = s( j,0, x), etc... From Kimura p. 220-222 (Kimura 1964 is the expected (mean) number of mutations at a given frequency, x, in the population at mutation-selection equilibrium. V(x) and M(x) are the contribution of drift and selection respectively to the rate of change of a mutation's frequency at frequency y in the population. Since this is an allele-based simulation, I use the equilibrium value of the effective number of chromosomes, N e , to account for inbreeding amongst N individuals.
The total rate of frequency change is the average of the rate of change of the effective haploid proportion of the population and the effective diploid proportion of the population weighted by F.
{ } e Substituting eq. 3 and 5 into eq. 1 yields: More familiar versions of eq. 6 can be derived by assuming neutrality or by assuming no frequency-dependent selection and either codominance or haploid/completely inbred individuals.
I approximate the integrals in eq. 6 using trapezoidal numerical integration and use the scan parallel algorithm implemented in CUB 1.6.4 (Merrill 2016) to accelerate the integration on the GPU*. λµ(x) is the expected (mean) number of mutations. To determine the actual number of mutations at a given frequency, x, I generate random numbers from the Inverse Poisson distribution with mean λµ(x) using the following procedure: I. Random number generator Philox (Salmon et al. 2011) generates a uniform random number between 0 and 1.
II. If λµ(x) ≤ 6, then that uniform variable is fed into the exact Inverse Poisson CDF. III. If λµ(x) > 6, then a Normal approximation to the Poisson is used.
Adding all the new mutations at every frequency to the starting mutation array is an embarrassingly parallel problem. Thus, combined with the parallel numerical integration for the definite integral components of eq. 6, initializing the simulation at mutationselection equilibrium is overall greatly accelerated on the GPU relative to serial algorithms on the CPU.

*An Aside About Numerical Precision, GPUs, and Numerical Integration:
For a bit of background, CPUs employ a Floating-point Processor Unit with 80-bits of precision for serial floating-point computation, which then quickly translates the result into doubleprecision (64-bit) for the CPU registers. Thus, CPU programs, including the serial Wright-Fisher simulation, are often written with double-precision performance in mind. In contrast, most consumer GPU applications are geared towards single-precision (32-bit) computation (e.g. graphics) and many consumer GPUs have relatively poor doubleprecision performance. More expensive, professional-grade workstation GPUs often have far better double-precision performance than their consumer counterparts. As the Wright-Fisher simulation does not actually require 64-bits of precision for its calculations, GO Fish has been written with 32-bits of precision computation in mind. This is even true of the MSE Integration step where the naturally pair-wise summation of parallel scanning mitigates the round-off error when performing large numbers of consecutive sums in 32bit (Higham 1993). That said, the mutation frequencies stored in the simulation have only single-precision floating-point accuracy. Experiments using CPU serial Wright-Fisher simulations showed consistent results between storing mutation frequencies with 32-bits vs. 64-bits of precision.

Steps to Calculate New Allele Frequencies
Migration, selection, and drift determine the frequency of an allele in the next generation, x t+1 , based on its current frequency, x t . Migration and selection are deterministic forces whereas drift introduces binomial random chance. While these three steps can, in principle, be done in any order, their order in the simulation is as follows: € m(k) = m(k, j,t), x t,k ≡ freq. of allele in pop. k at time t, x mig = x mig, j ≡ freq. of allele in pop. j after migration, Fish uses a conservative model of migration (Nagylaki 1980). The new allele frequency in population j is the average of the allele frequency in all the populations weighted by the migration rate from each population, to population j. And the migration rate is specified by the proportion of chromosomes from population k in population j.

II. Selection (with Inbreeding)
In population j at time t: € x mig = x mig, j ≡ freq. of allele after migration, y mig = 1 − x mig , x mig, sel = x mig, sel, j ≡ freq. of allele after migration and selection, P AA ,P Aa ,P aa ≡ frequency of genotype AA, Aa, and aa, Like with M(x) in eq. 4, w and n are a weighted average of the effective haploid (inbred) and diploid (outbred) portions of the chromosome population. Diploid genotype frequencies assume random mating and Hardy-Weinberg equilibrium (Hardy 1908, Weinberg 1908.
Following the same logic as above: Substituting eq. 11d and 12b into eq. 10 yields: Again, like for eq. 6, more familiar forms of eq. 13 may be derived under certain assumptions such as neutrality, haploid/inbred individuals, and completely outbred diploids.

III. Drift (with Inbreeding)
For population j in generation t: The variable x mig,sel represents the expected frequency of the allele in generation t+1.
Drift is the random deviation of the actual frequency of the allele from this expectation.
To determine the actual frequency of the allele in the next generation, x t+1,j , I generate random numbers from the Inverse Binomial distribution with mean N e x mig,sel and variance N e x mig,sel (1-x mig,sel ) using the following procedure: I. Random number generator Philox (Salmon et al. 2011) generates a uniform random number between 0 and 1. II. If N e x mig,sel ≤ 6, then that uniform variable is fed into the exact Inverse Poisson CDF as an approximation to the Binomial. III. If N e x mig,sel > 6, then a Normal approximation to the Binomial is used.

Adding New Mutations
Using the glossary from Table 1, for population j in generation t: The Poisson Random Field shares an important assumption with Watterson's infinite sites model in that regardless of how many sites are currently polymorphic, mutations will never strike a currently polymorphic site and the number of monomorphic sites that a mutation can occur at is always the total number of sites, L (Watterson 1975, Sawyer andHartl 1992). Eq. 14 defines the expected number of mutations in population j for generation t+1. The actual number of new mutations is drawn from the Inverse Poisson distribution using the same procedure detailed in Simulation Initialization. New mutations can be added to generation t+1 in parallel and simultaneously with the new frequency calculations. Each new mutation is given a 4-part unique ID consisting of the thread and compute device that birthed it (if more than one graphics card is used) as well as the generation and population in which it first arose.

Compact
The general compact algorithm is outlined in Figure 2C). GO Fish's version uses a more advanced variant to speed up compaction and lower its memory requirement as discussed in (Billeter et al. 2009) and adapted from (Bakunas-Milanowski et al. 2015).